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^ ^ Background: Identification of causal SNPs in most genome wide association studies 

Q relies on approaches that consider each SNP individually. However, there is a 

^*\ strong correlation structure among SNPs that need to be taken into account. Hence, 

increasingly modern computationally expensive regression methods are employed 
for SNP selection that consider all markers simultaneously and thus incorporate 
dependencies among SNPs. 
I— —I Results: We develop a novel multivariate algorithm for large scale SNP selection 

Ph using CAR score regression, a promising new approach for prioritizing biomark- 

"^ ers. Specifically, we propose a computationally efficient procedure for shrinkage 

estimation of CAR scores from high-dimensional data. Subsequently, we conduct a 
comprehensive comparison study including five advanced regression approaches 
^ '^ ^ (boosting, lasso, NEG, MCP, and CAR score) and a iinivariate approach (marginal 

correlation) to determine the effectiveness in finding true causal SNPs. 

C^ Conclusions: Simultaneous SNP selection is a challenging task. We demonstrate 

^ that our CAR score-based algorithm consistently outperforms all competing ap- 

^^ proaches, both uni- and multivariate, in terms of correctly recovered causal SNPs 

(^-^ and SNP ranking. An R package implementing the approach as well as R code to re- 
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Background 



Genome-wide associations studies (GWAS) are now routinely conducted to search for 
genetic factors indicative of or even causally linked to disease. Typically, the aim of 
such a study is to identify a small subset of single nucleotide polymorphisms (SNPs) 
associated with a phenotype of interest. From an analysis point of view the screening for 
relevant biomarkers is best cast as a problem of statistical variable selection. In GWAS 
variable selection is very challenging as the full set of SNPs is often very large while 
both the effect of each potentially causal SNP as well as their number is very small (e.g. 
Hoggart et aLl|2008[|Ayers and Cordellt|2010{rGuan and Stephens} [20TT ). 

To date, most GWAS are based on single-SNP analyzes where each SNP is considered 
independently of all others and association with the phenotype is computed using a 
univariate test statistic such as variants of the f-score, the ATT statistic (|Armitage 1955| 



or marginal correlation (Foulkes 2009[ |. The advantage of this approach is that it is 
computationally inexpensive. However, it implicitly assumes complete independence of 
markers and thus ignores the correlation structure among SNPs, e.g., due to linkage or 
interaction among SNPs. 

In order to increase statistical efficiency and to exploit the correlation among predic- 
tive SNPs several authors have recently started to investigate simultaneous SNP selection 
using fully multivariate approaches. This was pioneered for GWAS in the seminal paper 



of [Hoggart et al. (20081 that introduced the NEG regression model, a shrinkage-based 
approach to select relevant SNPs. A related approach is LASSO regression that was 



employed to GWAS by |Wu et al.] ( |2009| ) , MCP regression ([Ayers and CordellfpOlOl , and 
Bayesian variable selection regression (Guan and Stephens 201 1 [ I . Another promising 
multivariate approach advocated for high-dimensional variable selection is boosting 
< |Hothorn and BuhLmann[ 2006) but this has not yet been investigated for GWAS. 

Recently, to address the problem of variable importance and selection under corre- 
lation in genomics, we have introduced two novel statistics, the correlation-adjusted 
f-score (CAT score) and the correlation-adjusted marginal correlation (CAR score), see 
Zuber and Strimmer ( [2009 2011| . These two measures are multivariate generalizations 
of the standard univariate test statistics that take the correlation among variables explic- 
itly into account and lead to improved rankings of markers as has been shown for data 
from transcriptomics and metabolomics. However, application of CAT and CAR scores 
has so far been restricted to medium to large dimensional settings only as computing 
these scores involves the calculation of the inverse matrix square root of the correlation 
matrix, which is prohibitively expensive in high dimensions. Thus, for SNP analyzes 
further computational economies are needed. 

Here, we develop a novel multivariate algorithm for large scale SNP selection using 
CAR score regression. Specifically, we propose a computationally efficient procedure 
that allows for shrinkage estimation of CAR scores even for very high-dimensional data 
sets. Subsequently, we conduct a systematic comparison of state-of-the-art simultaneous 
SNP selection procedures using data from the GAW17 consortium (Almasy et al. 2011| . 
These data are particularly suited for investigating relative performance as the true 



causal SNPs are known. Finally, we demonstrate that SNP rankings based on correlation- 
adjusted statistics consistently outperform all investigated competing approaches, both 
uni- and multivariate. 



Methods 

Univariate ranking of SNPs 

The basic setup we consider here is a linear regression model for a set of d predictors 
X = {X-[,. . . ,Xii} and a metric or binary response variable Y. In GWAS the covariates 
X are given by the genotype and the response Y is the phenotype or trait of interest. The 
correlation matrix among the predicting variables has size d x d and is denoted by P 
(capital "rho"). The vector of marginal correlations Pxy = (PXiY/ • • • / Px^y)^ contains the 
correlations between a metric response and each individual SNP. Similarly, for binary 
response the f-score vector r = (ti, . . . ,t^Y contains the i-scores computed for each 
variable. 

If there is no correlation among SNPs (i.e. P = Id) the f-scores t provide an optimal 
ranking of SNPs in terms of predicting a binary Y (lEfron 2009[|. Likewise, for metric 



response the marginal correlations lead to an optimal ordering ( [Fan and Lv[ 20081 
Moreover, in the absence of SNP-SNP correlation the squared values of the ranking 
statistics (squared i-score, squared marginal correlation) are useful measures of variable 
importance, adding up to Hotelling's T^ and the squared multiple correlation coefficient 
R^, respectively. 

CAT and CAR score 

In many important settings the correlations P do not vanish but rather represent addi- 
tional structure relating the predictors. In the case of SNPs the correlation may be rather 
large, e.g. due to linkage effects ([Ardlie et al. 2002[|. Thus, both for variable ranking 



and for assigning variable importance it can be essential to take the correlation between 
covariates into account. 

To this end we have proposed a simple modification of the f-statistic and marginal cor- 
relations. In |Zuber and Strimmer| ( |2009| ) we have introduced the CAT score (correlation- 
adjusted f-score) that is defined as 

where P^^'^ is the inverse of the matrix square-root of P. The vector r^'^^ contains the ad- 
justed f-scores which measure the influence of each predictor on Y after simultaneously 
removing the effect of all other variables. The squared CAT score may thus be used as 
measure of variable importance. Unlike squared f-scores they sum up to Hotelling's T^ 
even in the presence of correlation, 

/^adj-jT^adj ^ ^Tp-1^ ^ jl _ 



Correspondingly, in'Zuber and Strimmer (2011|> we investigated a correlation-adjusted 



marginal correlations (CAR scores) 

P'^l = p-^/^PxY . (2) 

The squared CAR scores sum up to the squared multiple correlation coefficient 

also known as coefficient of determination or proportion of variance explained. Because 
of this decomposition property CAT and CAR scores allow to assign importance not 
only to individual SNPs but also to groups of SNPs. Moreover, both CAT and CAR score 
share a grouping property that leads to similar scores for highly correlated SNPs. In 
addition they protect against antagonistic SNPs, i.e. if two SNPs are highly correlated 
and one has a protective and the other a risk effect, then both SNPs are assigned low 
scores. 

For model selection using CAT and CAR scores, i.e. for identification of those SNPs 
that do not contribute to predict the response Y, we use a simple thresholding procedure 



with the critical threshold obtained by controlling local false discovery rates ( [Klaus and 
Strimmer 2012| >. 



In previous work we have shown for synthetic data as well as for data from metabolomic 
and gene expression experiments that CAT and CAR scores are effective multivariate 
criteria for obtaining compact yet highly predictive feature sets. Independently, in the 
study of Allen and Tibshirani| ( |2012| ) it was also found that CAT scores result in favorable 
orderings of variables. 

However, with increasing dimension d the correlation matrix P becomes prohibitively 
large both to compute and to handle effectively. As a result, in high dimensions direct 
calculation of CAT and CAR scores using Eq. IT] and Eq. |2] is not possible. Thus, for 
application in high-dimensional data such as from GWAS an alternative means of 
computation must be developed. 

Computationally efficient calculation of shrinkage estimators of CAT and 
CAR scores 

If the number of observations n is smaller than the number of variables d we need to 
employ a regularized estimate for the correlation matrix P. A simple shrinkage estimator 
R for P is given by 

R = Aid + (1 - A)Rempirical 

where -Rempirical is the empirical non-regularized correlation matrix and A is a shrinkage 
intensity (e.g. Schafer and Strimmer 2005[|. Using computational economies akin to 



those discussed in |Hastie and Tibshiram ( 2004, ) we now show that computation of 
R^^^^ and subsequent calculation of estimates of CAT and CAR scores can be done in a 
computationally highly effective way, even w^hen direct computation of CAT and CAR 
scores via Eq.llland Eq.[2]is infeasible. 



Using singular value decomposition the empirical correlation matrix can be written 
^empirical = A/ (1 — A)UMU^ where M is positive definite matrix of size m x m, U an 
orthonormal matrix of size d x m, and m = rank(Rempirical) « d. This simplifies the 
shrinkage estimator to 

R = A(Id + UMU^) . 



Following [Zuber and Strimmer ( 2009[ l we then compute the a-th matrix pow^er of R 
using 

dxm mx m mxd 

This implies we only have to compute the matrix power of the m x m matrix I^ + M to 
obtain R"^. Moreover, for efficiently calculating CAT and CAR scores it is crucial to note 
that it is not at all necessary neither to store or to compute the full d x d sized matrix 

R-i/2as 

„adj _ „-l/2n 

= \-^'^{U - U{I,„ - {Im + M)-^'^)U^)RxY (3) 

= \-^'^{RxY- U{lm - {I m +M)-^/] ){U^RxY)) • 
dxl dxm mxl 

Consequently, Eq.ls] allows to obtain shrinkage estimates of CAT and CAR scores effec- 
tively even in high dimensions as none of the matrices employed in Eq.lslis larger than 
dxm, and most are even smaller {d x 1 or m x 1), all without actually computing the 
shrinkage correlation matrix R. 

Results and Discussion 

We now compare the proposed CAR score approach to simultaneous SNP selection with 
competing methods and determine its effectiveness in finding true causal SNPs. 

For this purpose we use the mini-exome data set compiled for the GAW17 workshop 
held 13-16 October 2010 in Boston (http://www.gaworkshop. org/gawl7/). This data set 
is a combination of real sequence data and simulated synthetic phenotypes, where the 
true causal SNPs are known. In our study we investigate univariate ranking by marginal 
correlation and five multivariate approaches. 



In order to facilitate replication of our results we provide complete R code ( R Devel 



opment Core Team 2012)|. Our R package "care" implements the developed algorithm. 



Moreover, we offer R scripts covering all analysis steps from preprocessing the raw 



data to plotting of figures atlhttp : //strimmerlab . org/sof tware/care/ The data are 



publicly available from the GAW consortium, see http : //www . gaworkshop . org/gawl7/ 
[data. htm l for details. 



GAW 17 unrelated data 



The compilation and simulation of phenotypes for the GAW17 mini-exome data set is 



described in detail in Aknasy et al. (2011 ). We focus here on the GAW 17 unrelated 
data with metric phenotypes Ql, Q2, and Q4. The corresponding sequence data matrix 
contains information on 24,487 SNPs for n = 697 individuals. For each phenotype there 
are B = 200 simulations. By construction, phenotype Ql has a residual heritability of 0.44 
and is influenced by 39 SNPs in 9 genes, whereas Q2 has a lower residual heritability of 
0.29 and is influenced by 72 SNPs in 13 genes. This suggests that discovery of true causal 
SNPs should be less challenging for Ql than for Q2. Phenotype Q4 has a heritability of 
0.70 but none of it is due to SNPs contained in the present data set. 



Preprocessing 

In the preprocessing of the sequences we first recoded the alleles in the raw data into 
0, 1, 2 assuming an additive effects model. Second, we standardized the data matrix 
to column mean zero and column variance 1. Subsequently, we removed duplicate 
predictors so that 15,076 unique SNPs remained. The set of true causal SNPs for both Ql 
and Q2 also contains each a duplicate, reducing the number of true unique SNPs to 38 
and 71. Finally, we further filtered out synonymous SNPs, as we are interested only in 
non-synonymous mutations. The resulting predictor matrix X is of size 697 x 8, 020, i.e. 
d = 8, 020 unique non-synonymous SNPs are simultaneously considered for selection. 
For preprocessing the response variables Ql, Q2, and Q4 we removed the influence of 
the three non-genetic covariates sex, age, and smoking by linear regression. The resulting 
residuals were standardized to mean zero and variance 1 which yielded B = 200 
response vectors }/[ ,y2 ' ^^^ v\ ' where h E 1, ..., B, each of size 697 x 1. 

SNP selection methods included in the comparison study 



Table 1: Software used in the comparison study. The R packages are available from the R 
softw^are archive CRAN at http://cran.r- project, org/. 



Method Software 



Reference 



CAR R package care 

COR R package care 

NEC HLasso program 

MCP R package ncvreg 

BOOST R package mboost 

LASSO R package glmnet 



Zuber and Strimmer 



Zuber and Strimmer 



Brehenv and ri 



Hoggartetal. '( 2008 " 



(2011 1 



(2nn 



uang 



(2011 



Hothorn and Biihlmann 



Friedman eT¥l.|(|2010|) 



(2006( 



For each of the B = 200 response vectors for Ql, Q2, and Q4 we computed a 
regression model including all d = 8, 020 SNPs as potential predictors. Following Ayers| 
and Cordell (2010[l w^e focused on regularized regression approaches. Specifically, we 



used the following five methods, all of which have been shown to be powerful tools for 
variable selection in large-scale regression settings: 



CAR: variable ranking by shrinkage CAR scores ([Zuber and Strimmer |20lT]>, 



NEG: regression with normal exponential gamma (NEC) prior ( [Hoggart et al. 
2008| , 



MCP: regression with MCP penalty ([Zhang} |20T0]|, 



BOOST: boosting ( |Schapiret|T990l >, and 



LASSO: lasso regression (|Tibshirani| 1996 1 



The corresponding software implementations are listed in Tab. IT] As a reference for 
comparison we additionally included two baseline methods: 

• COR: univariate SNP ranking by marginal correlation, and 

• RND: random ordering of all SNPs. 

All methods except CAR and COR combine regularization with variable selection. 
Thus, for determining model sizes for CAR scores and COR we adaptively estimated a 



threshold from the data using a local FDR cutoff of 0.5 as recommended in Klaus and 



Strimmer (2012|l. In settings with rare and weak features this particular choice coincides 



with the so-called "higher criticism" threshold that has shown to be powerful for signal 



identification in classification (e.g.,|Donoho and Jin| 2008 2009 Duarte Silva. 2011 ). For 



computing the FDR values we employed the R package f drtool (Stri mmer^ |2008a]b | ) . 

Generally, all software were run with default settings. The regularization parameters 
required by the NEG, MCP, BOOST and CAR approaches were set to fixed values 
optimizing the overall performance of each method. Specifically, for CAR and MCP we 
employed A = 0.1, for BOOST i/ = 0.1, and for NEG A = 85. For LASSO we used the 
built-in cross-validation routines. 

Relative performance of investigated methods 

The aim of this study is to compare simultaneous SNP selection methods with regard 
to their ability to discover the true known SNPs. For this purpose we investigated the 
respective SNP rankings and the corresponding true positives, the size of the selected 
models, and the variability across the 200 repetitions. 

In Fig.llland the associated Tab.|2]we compare the effectiveness of SNP rankings for 
phenotypes Ql and Q2. For Ql all methods uniformly outperform marginal correlation, 
i.e. at the model size determined by each procedure the number of true positives is larger 
than that for marginal correlations at the same cutoff. Thus, for Ql all multivariate SNP 
selection approaches improve over univariate selection. Moreover, as can be seen from 
Fig. IT] (top row) and Tab. |2] for small numbers of included SNPs all methods perform 
similarly but starting from model size of 50 SNPs CAR scores lead to a better ranking in 
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Figure 1: Average true positives resulting from SNP rankings of the investigated ap- 
proaches for phenotype Ql (top row) and Q2 (bottom row). For Ql there are 38 true 
SNPs and for Q2 71 true SNPs. 



Table 2: Median model sizes and the corresponding interquartile ranges (IQR) as well 
as the average true positives for phenotypes Ql and Q2 for all investigated methods 
summarized across the 200 repetitions (first three columns). For comparison, the last 
three columns show the average true positives at the specified model size for CAR, COR 
and RND. The best performing method is shown in bold, the second best in italic. 





Results 




Comparisons 


Method 


Model Size 


TP 


TP 


TP 


TP 




Median (IQR) 


Method 


CAR 


COR 


RND 


Ql 












CAR 


51 (53) 


5.85 


5.85 


5.42 


0.23 


COR 


176 (108) 


8.06 


8.99 


8.06 


0.88 


NEC 


1390 (118) 


15.31 


17.57 


14.38 


6.60 


MCP 


20(5) 


4.11 


4.19 


3.95 


0.12 


BOOST 


53(5) 


5.84 


5.91 


5.50 


0.25 


LASSO 


37 (31) 


5.19 


5.21 


4.89 


0.18 


Q2 












CAR 


31 (38) 


2.93 


2.93 


2.85 


0.29 


COR 


1(7) 


0.38 


0.21 


0.38 


0.00 


NEC 


1632 (755) 


20.21 


28.08 


25.90 


14.50 


MCP 


29(5) 


2.75 


2.82 


2.76 


0.28 


BOOST 


59(6) 


3.92 


4.34 


3.82 


0.59 


LASSO 


15 (36) 


1.50 


1.88 


1.97 


0.14 



terms of true positives than all other competing approaches. For the more challenging 
phenotype Q2 the situation is similar. CAR scores almost always provide the most 
effective ranking (see lower part of Tab.|2]) but intriguingly for this phenotype it is also 
the only multivariate method that improves over marginal correlation. 

In Tab.|2]we also list the median model sizes for each regression approach. LASSO 
and MCP generally lead to small numbers of selected SNPs (less than 40), BOOSTING, 
CAR and COR variable sets are medium sized and NEG chooses a very large number 
of SNPs. Note the variability in the estimated model sizes as quantified by the corre- 
sponding interquartile ranges (IQR) is largest in the methods that estimate the threshold 
adaptively from the data (CAR, COR, LASSO) whereas it is smallest for those methods 
where we used a fixed regularization parameter (NEG, MCP, BOOST). Finally, in Tab. [3] 
the model sizes and IQR for phenotype Q4 is shown for the investigated methods. Here, 
COR and LASSO lead to the smallest model sizes and thus the smallest number of false 
positives, with the MCP and CAR methods being the runners-up. 

In further investigation of these results we identified the actual true SNPs recovered 
by each SNP selection approach. Specifically, we counted which of the 38 respectively 
71 true causal SNPs for Ql and Q2 were found among the first 100 top ranking SNPs 



Q1 




FLT1 IC13S523 
FLTl IC13S522 
FLTl IC13S524 
KDR IC4S1877 
FLTl IC13S431 
KDR IC4S1S78 
KDR I 0451884 
VEGFC I C4S4935 
ARNT [C1S6E33 
VEGFAiC6S2981 
H[F1A|CMS1734 
KDR IC4S1874 
ARNT IC1S6542 
KDR I C4S1861 
FLTl IC13S547 
KDR I C4S1887 
HIF3AI C19S4799 
KDR IC4S1879 
ARNT IC1S6537 
FLTl IC13S505 
FLT4 IC5S5133 
FLTl IC13S479 
FLTl IC13S567 
ARNT IC1S6561 
ELAVL4|C1S3182 
KDR IC4S1890 
KDR IC4S1873 
HIF1A|C14S1718 
H1F3A|C19S4S15 
ELAVL4|C1S3181 
ARNT IC1S6540 
FLTl IC13S514 
FLTl IC13S399 
FLT4 IC5S515S 
HIF1A|C14S1729 
FLTl IC13S320 
HIF1A|C14S1736 
HIF3A I C19S4831 



Q2 




BCHE I C3S4E34 



10 

Figure 2: Frequency of occurrence of each true SNP among the top 100 SNPs selected by 
each approach for phenotype Ql (top row) and for Q2 (lower row) for the 200 repetitions. 
Note that the SNPs are ordered according to the first column. 



Table 3: Median model sizes and the corresponding interquartile ranges (IQR) for 
phenotype Q4. 

Q4 

Model Size CAR COR NEC MCP BOOST LASSO 

Median 34 1900 27 59 1 

IQR 40 1 2713 4 6 6 



using the 200 repetitions available for each phenotype. The result is shown as a heatmap 
in Fig. Inland visualizes the relative difficulty of recovering the individual causal SNPs. 
In Ql, there are two SNPs on top of the heatmap that are consistently detected by all 
methods. Then, there is a large block primarily recovered by CAR score and correlation, 
but not by the other approaches. Finally, there are some moderate detections only in 
CAR scores and NEG regression. Half of the true positives are hardly discovered by 
any method. The comparison with randomly ordered SNPs (column RND) shows that 
those SNPs only appear by chance. For Q2, there is only a single SNP that is consistently 
included in all models. As in Ql, it is followed by a small group of detections most 
prominent in CAR score and correlation. Finally, there are some moderate findings for 
both, the CAR score and NEG, and some only for correlation. In addition, hierarchical 
clustering of the columns (methods) in this heatmap (tree not shown in figure) reveals a 
basic similarity pattern among the methods: CAR and COR cluster together, NEG and 
MCP regression form another cluster, and LASSO and BOOST are grouped together. 

In Tab. [4] we list the SNPs identified by the CAR score among the top 100 SNPs in 
at least 50 of 200 repetitions along with their minor allele frequency (MAP) and BETA 
values. We consider SNPs with a MAP value smaller than 0.01 as rare and SNPs with 
a larger MAP value as common variants. The BETA value measures the effect size in 
the actual simulation of the phenotype (|Almasy et al. 2011|. We find large differences 



between true positive SNPs of the two phenotypes. Whereas Ql is characterized by 
SNPs with strong effects and moderate MAFs, the true SNPs for Q2 have a very low 
MAP and are much harder to detect. Interestingly, most of the SNPs recovered by CAR 
scores are rare SNPs with comparatively large BETA values. Common SNPs are found 
as well, then also with small effect values. Thus, CAR scores are successful in achieving 
a high true positive rate because they not only allow to identify common SNPs but also 
SNPs with small MAP if a strong signal is present (large BETA). 

The last column in Tab.lllprovides information about the average absolute correlation 
among all true SNPs for Ql and Q2 as well as among the identified SNPs on the same 
gene. We observe that the true positive SNPs in Ql best identified by the CAR score 
are highly correlated within the same gene. This demonstrates that the CAR score 
successfully utilizes the correlation structure among SNPs to optimize the ranking. For 
phenotype Q2 the correlation among the true SNPs is generally lower compared to Ql, 
still except for BCHE the correlation among SNPs on the same gene is larger compared 
to the average correlation between a randomly chosen pair of true SNPs. 

Finally, in Tab. [S] we provide the proportion of rare and common SNPs found among 
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Table 4: True SNPs found among the top 100 SNPs identified by CAR scores in at least 
50 of the 200 repetitions for Ql and Q2. The last column shows the average absolute 
correlation among all SNPs for Ql and Q2 as well as the average absolute correlation for 
the SNPs belonging to one gene. 



SNP 


Frequency 


MAF 


BETA 


Correlation 


Ql 








0.014 


ARNT 1 C1S6533 


88 


0.011478 


0.56190 




FLTl 1 C13S431 


110 


0.017217 


0.74136 


0.147 


FLTl C13S522 


200 


0.027977 


0.61830 


0.147 


FLTl C13S523 


200 


0.066714 


0.64997 


0.147 


FLTl C13S524 


164 


0.004304 


0.62223 


0.147 


KDR 1 C4S1877 


145 


0.000717 


1.07706 


0.111 


KDR C4S1878 


101 


0.164993 


0.13573 


0.111 


KDR 1 C4S1884 


95 


0.020803 


0.29558 


0.111 


VEGFA 1 C6S2981 


69 


0.002152 


1.20645 




VEGFC C4S4935 


91 


0.000717 


1.35726 




Q2 








0.008 


BCHE 1 C3S4869 


54 


0.000717 


1.01569 


0.001 


BCHE C3S4875 


59 


0.000717 


1.09484 


0.001 


LPL 1 C8S442 


69 


0.015782 


0.49459 




SIRTl 1 C10S3048 


54 


0.002152 


0.83224 


0.330 


SIRTl C10S3050 


72 


0.002152 


0.97060 


0.330 


VNNl 1 C6S5380 


138 


0.170732 


0.24437 




VNN3 C6S5441 


59 


0.098278 


0.27053 


0.066 


VNN3 C6S5449 


57 


0.010043 


0.66909 


0.066 



the top ranking 100 SNPs for each methods. This also shows that the proposed approach 
based on CAR scores is effective in finding rare SNPs. 

Conclusions 

Large scale simultaneous SNP selection is a statistically and computationally very 
challenging task. To this end, we have introduced here a novel algorithm based on 
CAR score regression that can be applied effectively in high dimensions. Subsequently, 
in a comparison study we have investigated five multivariate regression-based SNP 
selection approaches w^ith regard to their ability to correctly recover causal SNPs and 
corresponding SNP rankings. 

As overall best method we recommend using CAR scores since this method was the 
only approach not only consistently outperforming the competing other multivariate 
SNP selection procedures in terms of identified true positives but also the only approach 
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Table 5: Proportion of common and rare variants of the true SNPs found among the top 

100 SNPs. 



Ql 


















Proportion (%) 


CAR 


COR 


NEC 


MCP 


BOOST 


LASSO 




Common 


0.56 


0.71 


0.63 


0.74 


0.71 


0.73 




Rare 


0.44 


0.29 


0.37 


0.26 


0.29 


0.27 


Q2 


















Proportion (%) 


CAR 


COR 


NEC 


MCP 


BOOST 


LASSO 




Common 


0.28 


0.41 


0.36 


0.44 


0.42 


0.43 




Rare 


0.72 


0.59 


0.64 


0.56 


0.58 


0.57 



uniformly improving over simple univariate ranking by marginal correlation. In addition 
we have shown that CAR scores also are successful in detecting rare variants w^hich 



recently have been recognize to be important indicators for human disease ( Bodmer and 
Bonillat[2008||McClellan and Kingt[M0| |. 
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